library(tidyverse)
library(knitr)
library(broom)
library(stringr)
library(modelr)
library(forcats)
library(ggraph)
library(igraph)
options(digits = 3)
set.seed(1234)
theme_set(theme_minimal())A network (also sometimes referred to as a graph) is a set of relationships. Networks contain a set of objects (nodes/verticies) and a mapping or description of relations between the nodes (link/edge). For example, a simple network contains two objects, 1 and 3, and one relationship, or edge, that links them:
The above edge is an undirected edge - there is no directionality to the relationship between 1 and 2. A directed edge is an ordered pair of nodes, with an arrow drawn to indicate the directionality of the edge:
There are lots of ways to analyze and measure networks. Let’s first look at two examples of network analysis applied to real-life data, then circle back and talk about packages for network visualization and analysis in R.
The Rise of Partisanship in the U.S. House of Representatives
There is little argument that political polarization is occurring in the United States. Partisanship can be attributed to many causes, including:
In this paper, the authors studying partisanship in the U.S. House of Representatives by examining relationships between legislators. On one hand, legislators are pressured by party leaders to vote with members of their own party with incentives such as committee assignments and campaign funding used to keep members in check. Alternatively, legislators have individual incentives to cooperate with members of the opposite political party (responsiveness to individual constituencies’ concerns).
Here, the authors measure the extent to which legislators form ideological relationships with members of the opposite party by examining cooperation rates between individual members of Congress on roll-call votes. In this research design, legislators are the nodes and the frequency of agreement on roll call votes are the edges. Edges are calculated for each pair of legislators serving in a two-year term of Congress, resulting in nearly 6,000,000 pairs of legislators. The data is represented in matrix form with the rows and columns identifying each legislator-term observation and the cells identifying the frequency of voting together in that term. Therefore the network/graph is undirected (it doesn’t matter the ordering of the pair, they will still have the same number of votes in agreement with one another).
Pairs are defined as either cross-party (comprised of a single Republican and a single Democrat) or same-party (comprised of two Democrats or two Republicans). We can then think of partisan affiliation as an attribute of each node, and the cross- or same-party pairing being an attribute of each edge.
This figure from the article is a fairly typical visualization of network data known as a node-edge diagram. Nodes are drawn as point marks and the edges connecting them are drawn as line marks. Drawing node-edge diagrams is tricky for the simple reason that you need to determine where to draw each node on a two-dimensional coordinate plane. The problem is that there is no inherent or meaningful value of x-y coordinates. In this data, there is no attribute/variable that tells us where to draw these nodes.
Instead, node-edge diagrams rely on algorithms to determine spatial positioning in the visualization. There are many algorithms that can perform this task, but they typically consider connectivity and distance. Distance is defined as the number of edges along the shortest path connecting two nodes. Nodes that are tightly connected by shorter distances would therefore be grouped closer together in the visualization. Nodes that are loosely connected by many edges or links should be farther apart on the graph.
One of the most common network layouts is force-directed placement.
In one variant, network elements are positioned according to a simulation of physical forces where nodes push away from each other while links act like springs that draw the endpoint nodes closer to each other. Typically this method places nodes randomly within the spatial region and iteratively refines the locations by gradually shifting nodes around until the layout improves and stabilizes.
Under this method, absolute spatial position does not encode any meaningful value. The algorithm is designed to minimize distracting artifacts that might confuse the viewer (e.g. edge crossings, node overlaps), so spatial location is merely a side effect. While absolute position is meaningless, relative spatial location can be meaningful. Tightly interconnected groups of nodes should be drawn relatively close together, which could indicate a substantive clustering. However this could also be an artifact of the algorithm. Alternative measures such as centrality are more robust measures of relative node importance in a network.
Because these algorithms start with random placement, in order to exactly reproduce a network visualization you should remember to set your random number seed at the beginning of the script (
set.seed()in R).
In this graph, edges are drawn between legislators who agree above the Congress’ threshold value of votes - this is defined as the average level of cooperation within each Congress. If the authors did not do this, every pair with at least one vote in agreement would have an edge drawn, making the graph even more complex.
We also see that the authors encode additional information in the visualization through extra channels:
Hairball diagrams are those with so many nodes and edges that the diagram becomes a jumbled mess and interpretation becomes extremely difficult. A general rule of thumb is that if the number of nodes is more than 4 times the number of edges, straight-forward force-directed placement will not be optimal.
In this example, the individual network diagrams are pretty bad if you want to interpret them at the legislator level. For instance, consider the 2011 graph:
Trying to track all the nodes and edges to identify legislators with the most ties (cooperative pairings) is downright impossible. There is just too much going on here. However, the visualization is very good at depicting the increasing partisanship in the U.S. House of Representatives. Democrats and Republicans are clustered together on the graph (partially the algorithm and partially the fact that legislators vote most frequently with members of their own party). It is easy to see that in the 1950s and 60s there were a lot of edges connecting legislators from both sides of the aisle. In fact we can even see more mixing of the nodes, where Republicans and Democrats are drawn more closely together on the grid. Over time, we can see both the number of cross-party edges decreasing and the spatial distance between the core Democrat and Republican clusters increasing, both outcomes of increased partisanship and decreasing cooperation.
Example adapted from David Robinson’s Analyzing networks of characters in ‘Love Actually’
First we use R to parse the raw script into a tidy data frame.
raw <- readLines("data/love_actually.txt")
lines <- data_frame(raw = raw) %>%
filter(raw != "", !str_detect(raw, "(song)")) %>%
mutate(is_scene = str_detect(raw, " Scene "),
scene = cumsum(is_scene)) %>%
filter(!is_scene) %>%
separate(raw, c("speaker", "dialogue"), sep = ":", fill = "left") %>%
group_by(scene, line = cumsum(!is.na(speaker))) %>%
summarize(speaker = speaker[1], dialogue = str_c(dialogue, collapse = " "))Next we match characters to actors:
cast <- read_csv(url("http://varianceexplained.org/files/love_actually_cast.csv"))
lines <- lines %>%
inner_join(cast) %>%
mutate(character = paste0(speaker, " (", actor, ")"))
lines[460:465,] %>%
knitr::kable(caption = "Example of tidied data")| scene | line | speaker | dialogue | actor | character |
|---|---|---|---|---|---|
| 52 | 581 | Sarah | Nobody’s trying to kill you, babe. | Laura Linney | Sarah (Laura Linney) |
| 52 | 582 | Sarah | Thank you. Don’t do that, my darling. Thank you. Don’t do that. | Laura Linney | Sarah (Laura Linney) |
| 53 | 583 | Harry | Right. Back at three. Christmas shopping, never an easy or a pleasant task. | Alan Rickman | Harry (Alan Rickman) |
| 53 | 584 | Mia | Are you going to get me something? | Heike Makatsch | Mia (Heike Makatsch) |
| 53 | 585 | Harry | Er… I don’t know, I hadn’t thought. Where’s Sarah, by the way? | Alan Rickman | Harry (Alan Rickman) |
| 53 | 586 | Mia | She couldn’t make it in today. Family thing. | Heike Makatsch | Mia (Heike Makatsch) |
In order to analyze the network structure, we need to count first the lines-per-scene-per-character, then convert this into a binary speaker-by-scene matrix:
by_speaker_scene <- lines %>%
count(scene, character)
by_speaker_scene## Source: local data frame [162 x 3]
## Groups: scene [?]
##
## # A tibble: 162 x 3
## scene character n
## <int> <chr> <int>
## 1 2 Billy (Bill Nighy) 5
## 2 2 Joe (Gregor Fisher) 3
## 3 3 Jamie (Colin Firth) 5
## 4 4 Daniel (Liam Neeson) 3
## 5 4 Karen (Emma Thompson) 6
## 6 5 Colin (Kris Marshall) 4
## 7 6 Jack (Martin Freeman) 2
## 8 6 Judy (Joanna Page) 1
## 9 7 Mark (Andrew Lincoln) 4
## 10 7 Peter (Chiwetel Ejiofor) 4
## # ... with 152 more rows
library(reshape2)
speaker_scene_matrix <- by_speaker_scene %>%
acast(character ~ scene, fun.aggregate = length)
dim(speaker_scene_matrix)## [1] 20 76
Hierarchical clustering is a technique for identifying clusters or subgroups in a data set. Dendrograms are a common visualization of these clusters. Observations which fuse into branches lower on the graph are generally similar to one another, whereas observations which fuse higher are less similar.
norm <- speaker_scene_matrix / rowSums(speaker_scene_matrix)
h <- hclust(dist(norm, method = "manhattan"))
ggdendro::ggdendrogram(h)This looks about right! Almost all the romantic pairs are together (Natalia/PM; Aurelia/Jamie, Harry/Karen; Karl/Sarah; Juliet/Peter; Jack/Judy) as are the friends (Colin/Tony; Billy/Joe) and family (Daniel/Sam).
One thing this tree is perfect for is giving an ordering that puts similar characters close together:
ordering <- h$labels[h$order]
ordering## [1] "Natalie (Martine McCutcheon)" "PM (Hugh Grant)"
## [3] "Aurelia (Lúcia Moniz)" "Jamie (Colin Firth)"
## [5] "Daniel (Liam Neeson)" "Sam (Thomas Sangster)"
## [7] "Jack (Martin Freeman)" "Judy (Joanna Page)"
## [9] "Colin (Kris Marshall)" "Tony (Abdul Salis)"
## [11] "Billy (Bill Nighy)" "Joe (Gregor Fisher)"
## [13] "Mark (Andrew Lincoln)" "Juliet (Keira Knightley)"
## [15] "Peter (Chiwetel Ejiofor)" "Karl (Rodrigo Santoro)"
## [17] "Sarah (Laura Linney)" "Mia (Heike Makatsch)"
## [19] "Harry (Alan Rickman)" "Karen (Emma Thompson)"
This ordering can be used to make other graphs more informative. For instance, we can visualize a timeline of all scenes:
scenes <- by_speaker_scene %>%
filter(n() > 1) %>% # scenes with > 1 character
ungroup() %>%
mutate(scene = as.numeric(factor(scene)),
character = factor(character, levels = ordering))
ggplot(scenes, aes(scene, character)) +
geom_point() +
geom_path(aes(group = scene))Alternatively, network data can be visualized using a matrix view based on a table from the original network data. For example, an adjacency matrix is a matrix where the nodes of the network are lined up on the rows and columns, and the edges are encoded in the cell values. To visualize the edges, we draw a heatmap with color used to encode the edges.
Here we draw an adjacency matrix for the Love Actually data. To generate the adjacency matrix, all we need is the speaker-by-scene matrix. Recall that this matrix records whether a character is present in every scene using a binary 0/1 coding scheme. To generate an adjacency matrix, we simply multiply the matrix by its transpose. This identifies the number of scenes in which every pair of characters appear.
non_airport_scenes <- speaker_scene_matrix[, colSums(speaker_scene_matrix) < 10]
non_airport_scenes[1:5, 1:5]## 2 3 4 5 6
## Aurelia (Lúcia Moniz) 0 0 0 0 0
## Billy (Bill Nighy) 1 0 0 0 0
## Colin (Kris Marshall) 0 0 0 1 0
## Daniel (Liam Neeson) 0 0 1 0 0
## Harry (Alan Rickman) 0 0 0 0 0
cooccur <- non_airport_scenes %*% t(non_airport_scenes)
cooccur[1:5, 1:5]## Aurelia (Lúcia Moniz) Billy (Bill Nighy)
## Aurelia (Lúcia Moniz) 5 0
## Billy (Bill Nighy) 0 6
## Colin (Kris Marshall) 0 0
## Daniel (Liam Neeson) 0 0
## Harry (Alan Rickman) 0 0
## Colin (Kris Marshall) Daniel (Liam Neeson)
## Aurelia (Lúcia Moniz) 0 0
## Billy (Bill Nighy) 0 0
## Colin (Kris Marshall) 6 0
## Daniel (Liam Neeson) 0 11
## Harry (Alan Rickman) 0 0
## Harry (Alan Rickman)
## Aurelia (Lúcia Moniz) 0
## Billy (Bill Nighy) 0
## Colin (Kris Marshall) 0
## Daniel (Liam Neeson) 0
## Harry (Alan Rickman) 10
To visualize the adjacency matrix, we draw a heatmap. Note that when using the default heatmap() function from the stats library, dendrograms are also added to the left and top sides of the visualization:
heatmap(cooccur)If you want to use ggplot(), you need to first convert the adjacency matrix into a tidy data frame, then plot that object (and you lose the dendrograms):
cooccur %>%
as_tibble %>%
mutate(id1 = rownames(cooccur)) %>%
gather(id2, n, -id1) %>%
mutate_at(vars(id1, id2), funs(factor(., levels = ordering))) %>%
ggplot(aes(id1, id2, fill = n)) +
geom_tile() +
scale_fill_continuous(low = "white", high = "red") +
coord_fixed() +
labs(x = NULL,
y = NULL,
fill = NULL) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))To draw the network as a node-edge diagram, first we need to convert the adjacency matrix into an igraph object.1 Rather than storing it in a pure matrix form, the igraph object is a specialized data object in R for storing network data. Many graphing functions in R for network data rely on this structure.2
g <- graph_from_adjacency_matrix(cooccur,
weighted = TRUE,
mode = "undirected",
diag = FALSE)
g## IGRAPH UNW- 20 37 --
## + attr: name (v/c), weight (e/n)
## + edges (vertex names):
## [1] Aurelia (Lúcia Moniz)--Jamie (Colin Firth)
## [2] Billy (Bill Nighy) --Joe (Gregor Fisher)
## [3] Colin (Kris Marshall)--Mark (Andrew Lincoln)
## [4] Colin (Kris Marshall)--Tony (Abdul Salis)
## [5] Daniel (Liam Neeson) --Karen (Emma Thompson)
## [6] Daniel (Liam Neeson) --Sam (Thomas Sangster)
## [7] Harry (Alan Rickman) --Jamie (Colin Firth)
## [8] Harry (Alan Rickman) --Karen (Emma Thompson)
## + ... omitted several edges
The first argument is the adjacency matrix, and the remaining arguments are:
weighted = TRUE - creates a weighted graph. Here the weights refer to the fact that some edges are larger than others (i.e. some characters appear together in more scenes than others). In an unweighted graph, edges are binary - either they exist between two nodes, or they do not.mode = "undirected" - there is no directionality to the edges.diag = FALSE - set the diagonal of the matrix to be zero. We want to ignore the fact that characters appear in different numbers of scenes overall (this is basically ignoring nodes linked to themselves, i.e. the raw number of scenes in which the character appears).To draw the node-edge diagram, we can use several packages. Here I use ggraph, an extension of ggplot2 which incorporates relational data structures into the layered grammar of graphics foundation.
ggraph(g) +
geom_edge_link(aes(edge_width = weight)) +
geom_node_point() +
geom_node_text(aes(label = name), repel = TRUE, size = 3) +
scale_edge_width_continuous(range = c(.5, 3)) +
theme_graph() +
theme(legend.position = "none")Technical details about the visualization:
nicely() algorithm from igraph. We will examine the output of additional algorithms later.A few patterns pop out of this visualization. We see that the majority of characters are tightly connected (often by the scenes at the school play, or by Karen (Emma Thompson), who is friends or family to many key characters). But we see Bill Nighy’s plotline occurs almost entirely separate from everyone else, and that five other characters are linked to the main network by only a single thread (Sarah’s conversation with Mark at the wedding).
ggraph for network visualizationsLet’s explore in more detail the different components for drawing graphs using ggraph. Again, this is not the only approach in R. In fact, the package was only published in the past few months and is therefore still building in support and usage. However it is pretty thoroughly documented and adheres to the grammar of graphics philosophy. Other packages for network visualization in R include:
igraph - the workhorse package in R. ggraph and other visualization tools heavily rely on this packages core functions for generating layouts, creating data structures, and other backend processing tasks. igraph also includes loads of functions for statistical analysis of networks (measures of density, reciprocity, centrality, distance, etc.)ggnet2 - another package using the ggplot2 syntaxggnetwork - yet another package using ggplot2, in fact from the same author of ggnet2. This package retains the same common functions in ggplot2 to add geom_()s to the visualization. It most resembles standard ggplot2 code.The layout defines the vertical and horizontal placement of nodes when plotting a graph structure. The layout algorithm takes in a graph structure and returns the vertical and horizontal position of the nodes. Different algorithms (and in fact different iterations) return different spatial positions:
ggraph(g) +
geom_edge_link(aes(edge_width = weight)) +
geom_node_point() +
geom_node_text(aes(label = name), repel = TRUE, size = 3) +
scale_edge_width_continuous(range = c(.5, 3)) +
theme_graph() +
theme(legend.position = "none") +
ggtitle("Default layout (Nicely) algorithm")ggraph(g, layout = "kk") +
geom_edge_link(aes(edge_width = weight)) +
geom_node_point() +
geom_node_text(aes(label = name), repel = TRUE, size = 3) +
scale_edge_width_continuous(range = c(.5, 3)) +
theme_graph() +
theme(legend.position = "none") +
ggtitle("Kamada and Kawai spring-based algorithm")ggraph(g, layout = "fr") +
geom_edge_link(aes(edge_width = weight)) +
geom_node_point() +
geom_node_text(aes(label = name), repel = TRUE, size = 3) +
scale_edge_width_continuous(range = c(.5, 3)) +
theme_graph() +
theme(legend.position = "none") +
labs(title = "Fruchterman-Reingold algorithm",
subtitle = "Force-directed layout, 500 iterations")ggraph(g, layout = "fr", niter = 100) +
geom_edge_link(aes(edge_width = weight)) +
geom_node_point() +
geom_node_text(aes(label = name), repel = TRUE, size = 3) +
scale_edge_width_continuous(range = c(.5, 3)) +
theme_graph() +
theme(legend.position = "none") +
labs(title = "Fruchterman-Reingold algorithm",
subtitle = "Force-directed layout, 100 iterations")ggraph(g, layout = "fr", niter = 10) +
geom_edge_link(aes(edge_width = weight)) +
geom_node_point() +
geom_node_text(aes(label = name), repel = TRUE, size = 3) +
scale_edge_width_continuous(range = c(.5, 3)) +
theme_graph() +
theme(legend.position = "none") +
labs(title = "Fruchterman-Reingold algorithm",
subtitle = "Force-directed layout, 10 iterations")ggraph(g, layout = "grid") +
geom_edge_link(aes(edge_width = weight)) +
geom_node_point() +
geom_node_text(aes(label = name), repel = TRUE, size = 3) +
scale_edge_width_continuous(range = c(.5, 3)) +
theme_graph() +
theme(legend.position = "none") +
ggtitle("Grid algorithm")ggraph(g, layout = "star") +
geom_edge_link(aes(edge_width = weight)) +
geom_node_point() +
geom_node_text(aes(label = name), repel = TRUE, size = 3) +
scale_edge_width_continuous(range = c(.5, 3)) +
theme_graph() +
theme(legend.position = "none") +
ggtitle("Star algorithm")ggraph(g, layout = "linear") +
geom_edge_arc(aes(edge_width = weight)) +
geom_node_point() +
geom_node_text(aes(label = name), repel = TRUE, size = 3) +
scale_edge_width_continuous(range = c(.5, 3)) +
theme_graph() +
theme(legend.position = "none") +
ggtitle("Linear algorithm")ggraph(g, layout = "linear", circular = TRUE) +
geom_edge_arc(aes(edge_width = weight)) +
geom_node_point() +
geom_node_text(aes(label = name), repel = TRUE, size = 3) +
scale_edge_width_continuous(range = c(.5, 3)) +
theme_graph() +
theme(legend.position = "none") +
ggtitle("Star algorithm (circular)")igraph